search space
012a91467f210472fab4e11359bbfef6-AuthorFeedback.pdf
First, as R4 suggested, "symbolic35 tree" was more approachable for people in the ML community. Second, the symbolic tree is declared by the user using36 decorators and serves to represent high-level program constructs, which is different from the AST that represents all37 the syntactic structures for the program. For example, the full Python AST contains information about objects' class38 methods, whereas our symbolic representation does not.39 R4: "Second, most of their tool/language design could be summarized as adding some kind of non determinis-40 tic/parametric choice ... It's extension to ML does not introduce anything particularly new ..."41 We agree with R4 that symbolic programming and non-deterministic programming are well-studied topics in the PL42 community. However, we would like to emphasize that this work is the first to introduce such concepts to AutoML43 to significantly reduce engineering effort, which is a novel and useful contribution. For example, PyGlove leverages44 symbolic manipulation to decouple the search algorithm, search space and child program, which enabled us to unify45 the interface among search methods with and without weight sharing. To enable symbolic programming in Python,46 PyGlove implements an object model for maintaining the consistency of program state during symbolic manipulation.47 R4 "Provide the grammar in the main text"48 We understand the "grammar" here as a reference to the formal definition of the search space specification. We will49 revise current Appendix Table 3 into a formal definition, and add it to the "search space" sub-section.50
MCUFormer: Deploying Vision Transformers on Microcontrollers with Limited Memory
Due to the high price and heavy energy consumption of GPUs, deploying deep models on IoT devices such as microcontrollers makes significant contributions for ecological AI. Conventional methods successfully enable convolutional neural network inference of high resolution images on microcontrollers, while the framework for vision transformers that achieve the state-of-the-art performance in many vision applications still remains unexplored. In this paper, we propose a hardware-algorithm co-optimizations method called MCUFormer to deploy vision transformers on microcontrollers with extremely limited memory, where we jointly design transformer architecture and construct the inference operator library to fit the memory resource constraint. More specifically, we generalize the one-shot network architecture search (NAS) to discover the optimal architecture with highest task performance given the memory budget from the microcontrollers, where we enlarge the existing search space of vision transformers by considering the low-rank decomposition dimensions and patch resolution for memory reduction. For the construction of the inference operator library of vision transformers, we schedule the memory buffer during inference through operator integration, patch embedding decomposition, and token overwriting, allowing the memory buffer to be fully utilized to adapt to the forward pass of the vision transformer. Experimental results demonstrate that our MCUFormer achieves 73.62% top-1 accuracy on ImageNet for image classification with 320KB memory on STM32F746 microcontroller.
Zero-shot causal learning
Predicting how different interventions will causally affect a specific individual is important in a variety of domains such as personalized medicine, public policy, and online marketing. There are a large number of methods to predict the effect of an existing intervention based on historical data from individuals who received it. However, in many settings it is important to predict the effects of novel interventions (e.g., a newly invented drug), which these methods do not address. Here, we consider zero-shot causal learning: predicting the personalized effects of a novel intervention. We propose CaML, a causal meta-learning framework which formulates the personalized prediction of each intervention's effect as a task. CaML trains a single meta-model across thousands of tasks, each constructed by sampling an intervention, its recipients, and its nonrecipients. By leveraging both intervention information (e.g., a drug's attributes) and individual features (e.g., a patient's history), CaML is able to predict the personalized effects of novel interventions that do not exist at the time of training. Experimental results on real world datasets in large-scale medical claims and cell-line perturbations demonstrate the effectiveness of our approach. Most strikingly, CaML's zero-shot predictions outperform even strong baselines trained directly on data from the test interventions.
SCOPE-FE: Structured Control of Operator and Pairwise Exploration for Feature Engineering
Park, Minhee, Son, Seongyeon, Lee, Yonghyun, Kim, Eunchan
Automatic feature engineering is an effective approach for improving predictive performance in tabular learning. However, expand-and-reduce methods, such as OpenFE, become increasingly computationally expensive as the input dimensionality grows. This limitation arises primarily from the combinatorial explosion of candidate features generated through operator-feature combinations. To address this issue, we propose SCOPE-FE, a structured search space control framework that improves efficiency by reducing the candidate space prior to feature generation. SCOPE-FE jointly regulates two major sources of combinatorial growth: the operator space and feature-pair space. First, OperatorProbing estimates the dataset-specific utility of candidate operators and eliminates low-contribution operators in advance. Second, FeatureClustering employs spectral embedding and fuzzy c-means clustering to group structurally related features, thereby restricting candidate generation to relevant within-cluster combinations. In addition, we introduce ReliabilityScoring, which incorporates variance across subsamples to stabilize pruning decisions. Experiments on ten benchmark datasets demonstrate that SCOPE-FE substantially reduces feature engineering time while maintaining competitive predictive performance relative to existing baselines. The efficiency gains are particularly pronounced for high-dimensional datasets. These results indicate that structured control of the search space is an effective strategy for scalable automatic feature engineering. The code will be made publicly available upon acceptance.
Harnessing the Power of Choices in Decision Tree Learning
We propose a simple generalization of standard and empirically successful decision tree learning algorithms such as ID3, C4.5, and CART. These algorithms, which have been central to machine learning for decades, are greedy in nature: they grow a decision tree by iteratively splitting on the best attribute. Our algorithm, Top-k, considers the k best attributes as possible splits instead of just the single best attribute.We demonstrate, theoretically and empirically, the power of this simple generalization. We first prove a greediness hierarchy theorem showing that for every k N, Top-(k +1) can be dramatically more powerful than Top-k: there are data distributions for which the former achieves accuracy 1 ε, whereas the latter only achieves accuracy 12 +ε. We then show, through extensive experiments, that Top-k outperforms the two main approaches to decision tree learning: classic greedy algorithms and more recent "optimal decision tree" algorithms. On one hand, Top-k consistently enjoys significant accuracy gains over greedy algorithms across a wide range of benchmarks. On the other hand, Top-k is markedly more scalable than optimal decision tree algorithms and is able to handle dataset and feature set sizes that remain far beyond the reach of these algorithms.
AutoGO: Automated Computation Graph Optimization for Neural Network Evolution
Optimizing Deep Neural Networks (DNNs) to obtain high-quality models for efficient real-world deployment has posed multi-faceted challenges to machine learning engineers. Existing methods either search for neural architectures in heuristic design spaces or apply low-level adjustments to computation primitives to improve inference efficiency on hardware. We present Automated Graph Optimization (AutoGO), a framework to evolve neural networks in a low-level Computation Graph (CG) of primitive operations to improve both its performance and hardware friendliness. Through a tokenization scheme, AutoGO performs variable-sized segment mutations, making both primitive changes and larger-grained changes to CGs. We introduce our segmentation and mutation algorithms, efficient frequent segment mining technique, as well as a pretrained context-aware predictor to estimate the impact of segment replacements. Extensive experimental results show that AutoGO can automatically evolve several typical large convolutional networks to achieve significant task performance improvement and FLOPs reduction on a range of CV tasks, ranging from Classification, Semantic Segmentation, Human Pose Estimation, to Super Resolution, yet without introducing any newer primitive operations. We also demonstrate the lightweight deployment results of AutoGOoptimized super-resolution and denoising U-Nets on a cycle simulator for a Neural Processing Unit (NPU), achieving PSNR improvement and latency/power reduction simultaneously.
Unsupervised Graph Neural Architecture Search with Disentangled Self-supervision (Appendix)
B.1 Complexity Analysis Denote the number of nodes and edges in the graph as N and E, the number of latent factors as K, the number of operation choices as |O|, the dimensionality of hidden representations as d. The time complexity of the disentangled super-network is O(K|E|d+K|V|d2), where the computation for each factor is fully parallelizable and amenable to GPU acceleration, and K is usually a small constant. The time complexity of the self-supervised training and contrastive search modules is both O(K2d2). As architectures under different factors share the parameters, the number of learnable parameters is the same as classical graph super-network, i.e., O(|O|d2). Therefore, the complexity of our method is comparable to classical GNAS methods.